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ABSTRACT. \ 

Observer disagreement is iiB^ortant insofar as it 
liwits the reliabilities of observational records. This discussion 
evolves around Methods and conditions under which observer agreenent 
cah be measured as to ai^|iimize suph an occurrence, Obs,ervers should 
be trained to nearly perfect agreement with a criterion or expert 
coder on unambiguous examples of behavioral categories before actual 
data collection. Disagreement on. aoibiguities may help reflect a nore 
accurate representation of the real world. In addition to 
criterion- related agteement, it is suggested that intraobserver 
agreement be obtaineid by showing a videp---t»«e twice to all observers 
in which conditions parallel those engr^untei^d in the field. While 
criterion-related ani^ intraobserver agreement measures have been 
recommended for both/before and during a study, they should not be 
used as evidence of i>bserver agreement in the actual classroom, but 
rather to assist an investigator iti documenting adequacy of 
observational skills. After a study is fin,ished, reliabilities of 
observational data and\ coefficients of stability and observer 
agreement should be ca3\culated by^u^sing^^±rrtrr^utass eorTelatiofi 
coefficients. (Author/Rfc) 
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Wliile there is a plej^hora of observation systems currently ejnployed 
in researcli in teacher behavior (Simon Pi Boyer, 1970), there is a corres- 

"^onding^paucity of evidence relating specific teaching skills to changes 
in- pupil classroom behavipr (Smith, 1971) . One possible reason for many 
insignificant findings may be that investigators have inadequately con- 
trolled for a number of sources of error associated xnth observational 
data (IlcCaw, Wardrop, § Bunda»-4972) . ' 

Prevalent confusion concerning reliabilities of observational records . 
can be traced to failure of separating two statistically related but 
conceptually different measures: observer agreenent . and reliabilities 
of observational records (Medley 5 Mitzel, 1958; 1963; McGaw, et al., 1972). 
It is generally agreed that the reliability of a test is a necessaty but 
not a sufficient condition for determining concurrent or jiredictjivo 
validity. Analogously, observei^-agreeinent is a primary issue, although not 
the most iraportantji .one, to be faced in interpretation of results of an 

^observational study. More important are the reliabilities of the observa- 
tional da ta - ^ i • e . , the extent to which observat ional records discrii)]^^ 

teachers, pupils, and situations within classroom environments. Observer 
disagreement is important insofar as it acts as a limiting factor on 
reliabilities of observational data. 

Reliabilities of teacher/pupil behavior and other classroom process 
variables are especially important in studies attempting to relate these 
\^riables to outcome raeasui-es jsuch as pupil growth (Soar, 1972). 

Unreliabilit)^ of either or both measures will tend to obscure any signifi- 

■ ■ \_ 
cant relationships that may exist. Moreover, reliabilities of classroom 

process variables are ultinately eiSsential for the purpose of generalization 

ab^ut relationships among teacher jand pupil behaviors. Given that certain 
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pupil outcomes arc predictably associated with specific teaching strategies, 
then such information can be conununicated to teachers for use in daily 
decision -making. 

In dealing with traditional measuring instruments, reliability has 
>een classically defined as the consistency with which the instrument measures 
something. Typically this has been done by using parallel forms of a test, 
test-retest methods, dr methods of internal consistency. A number of 
assumptions, however, are made about this parallel measures concept. That 
is, two or more tests are assumed to be equivalent in content, means, vari- 
ancei, and intercorrelatiohs of items (Cronbach, Rajaratnam, f, Gleser, 1963). 

While these assumptions are seldom fully met in practice when using 
traditional tests, they become impractical when applied to observational 
studies where human raters (who take tlie place of tests) are rarely identical 
or equivalent in their observational skills. 

In order to avoid such assumptions a number of statisticians have 
proposed the use of intraclass >rrelation coefficients as a means of 
determ in ing rel iabiliti es (Ha ggard, 1958; Cronbach, etai., 1963; Gleser, 
Cronbach, § Rajaratnam, 1965; .Medley S Mitzei', 1958, 1963; rfcCaw, et al., 
1972) • Taere has been general agreement abbut the formula for a reliability 
in terms of population parameters. 

That is: 

2 2 

p . "t = °t — 

X t ^ 



where: 



is defined as the variance of the true scores (e.g., of teacher 
behavior) around the mean of all such true scores in the population 
of teachers represented by th6 sample of teachers actually observed. 
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And: 

• is hot as easily defined. Hie definition of v;ill vary according 
to the procedure used to collect the data( Basically,- represents 
the variance o^ the obtained measures on ill the teachers in the 
population about tfieir own mean. (Medley !§ Mitzei, 1963). • 

IVhile specific coii5)onents of variance included in o| are defined by the 

' , .... - \ 

design of a study andean investigator's particular interests, the obtained 

trsifi&me can be considered to consist of a true variance component, a|, and 

a generic error variance component, (McGaw\ et al., 1972). Likewise, 

some components of generic error variance are determined by the nature of 

the reliability coef ficienjt-^ interest. However, observer disagreement, 

or error variance attributable to observers, is almost always considered 

to be a part of the generic error variance when computing reliabilities of 

observational records. Thus, minimal observer disagreement is a necessary 

but insili^icient condition for high reliability coefficients,, since there are 



' other components, of the generic^ error variance which are theoretically 

independent from observer error variance (e.g., intra«subject variance from 
-. 4. . occasion to occasion) . 

Other sources of error can contribute to unreliability of observational 



rocords as vi?ell as observer disagreement • Instabilijf^y d|f teacher/pupil 
behavior from occasion to occasion is typically the greatest source of error 
(Medley § Mitzol, 1963). Poorly designed observation systems and studies - 
can also contribute to unreliability. Finally, inappropriate data analysis 
procedures can obfuscate actual differences among classrooms. Although 
human behavior is frequently unstable across separate occasions, it is 
possible to minimize observer, errors and investigator errors. 
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I Observer Erroys ■ - >. 

• Observer agreement is typically calculated by comparing observational 
records of two or more observers with each otiier or with an expert when 
simultaneous 1> coding the same classroom events. Observer agreement is, 
however, not synonymous with reliabilities of observatiimal records as is 
often mistakenly assu.ued. For exampl-e, observers can be trained to a high 
level of agreement^ yet they can. collect very unreliable data if the behaviors 
of the observed teachers/pupils differ little, or if behaviors are truly 
unstable from occasion to ot^casion. That is, if variance between subjects « 
(true variance) is small relative to variance within subjects (error* / 
variance), the measurement will be unreliable regardless of the extent of 

i ■ 

observer agreement. 

Nonetheless, observer disagreement cannot be totally ignored. If it 
cannot be documented that observers adequately agreed on the same, behavioral 
events i^on completion of training, and i£ the data they collected proved 
to be unreliable, a l:ritical paradox is faced: Is the source of unreliabi^lity 
of observational records due primarily to lack of observer agreement or ' 
consistency in the field? Or, is the source of unreliability largely due 
to a lack of discriminable differences among teachers/pupils --either because 
behaviors are too unstable or because there .are no existing differences 
among subjects? 

If a^quate observer agreenid^t cannot be demonstrated, observational 
data may be confounded with errors attributable to observer misunderstandings 
of category definitions, inconsistencies, and biases which are realistically 
inseparable from actual inconsistencies (error variance) within teachers/ 
pupils and true variatioii among teachers /pupils in classroom observation 
research designs. 



Once a study is finished, the extent to which observe" errors detract 
from reliabilities can be estimated by intraclass correlation coefficients, 
providing certain design requ;irements are (Medley § Mitzel, 1963\. It 
is too late, however, at this point in time m retrain observers if observer 
disagreement is found to be s^iously limiting reliabilities. Ilence, methods 
and conditions under ^ which observei:* a^rreement can be measured so as to 
minimize such an occurrence .are' of prime concern. 
Observer Agreement t/hen? 




Perhaps the first issue to be addressed is when measures of observer 
cement paiv b.e,md£L SQ aa to minimize the possibility that it is observers 
who are the prime source of potentially unreliable observational data. 

This suggests thatia researcher should demonstrate that observers were 
adequately trained before actual data collection. Even if this prior 
agreement check was accomplished under conditions identical to these of actual^ 
data collection,^ it is still no guarai\tee that observer skills will not 
deteriorate during data collection. It foUoi^s then that checks be made 
to insure that observers are maintaining their skills during the study 
as well. If certain observers are found to be deteriorating in their skills 
at an early stage in the study, remediation or deletion is possible without 
the loss of the entire study. 

The frequency of maintenance tests largely depends on how much a 
researcher is willing to gamble against the possibility of obtaining results 
which are unreliable primarily because of observer disagreement. Of course, 
if results are reliable, observer agreement becomes a moot point unless the^re 
is a reason to Suspect that observers are biased in some manner or that a 
•'halo effect** may be contributing to reliability (see Medley § Mitzel, 1963). 

^The authors not intend to imply that this is the most desirable 'neans 
of measuring observer agreement, as will be discussed later. 



Tlie nuniber of maintenance trials also depends on other factors such as the. 
complexity of the observation system, the design of the study, the contin- 
gencies/under which observers code, the length of the study, the frequenisy 
of observations for each observer, scheduling problems, economic cons idera** 
tions, etc. It is vdse to schedule a maintenance check some time early in 
the study in order to reduce. the likelihood of collecting large amounts of 
data ridden With observer errors. At minimum, an additl?yna:l" test near the 

■ - ■ . . ' ■ \ 

end of the study is recommended. \ 
Agreement on IVhat Kinds of Data? \ 

Medley (personal communication, 1972) has noted tliat one logical con- 
sideration for determining observer agreement is that of how the data are^ 
to be analyzed. Althougli this may appear obvious, it is sometimes overlobked. 
For example, Flanders (1967) calculates observer agreement using a modi f if 
cation of Scott's it index (1955) based on total frequencies across all 
categories, comparing two observers at a time. Yet in data analysis he 
utilizes two-stage behavior chains in his mtrices. It is conceivable . 
that observers could agree very well in overall Ci te gory totals, yet pro^ 
foundly disagree on identification of two-Jtage behavior patterns. 

If data are to be analyzed by frequencies of individual categories 
emitted, it is suggested that observer agreement measure^ be determined on 
total frequencies for each category. If comparisons of frequencies of 
groups of categories (scales) are planned, it is suggest|d that observer 
agreement measures be based on total scale frequencies. Or, if analysis 
of three-stage patterns or chains of behavior is intended, computation 
o^observer agreement on the basis of thrie-stage chains is recommended. 

In short, observer agreement should be computed on the same unitfs) 
of behavior that will be used in late;f analyses. 




A greement with j Vhom? 

As w^s stated earlier, observer agreement is typically calculated by 

v- ■ 

comparing observational records of two or more observers with each other 
or with an "expert" when coding the same classroom events at the same time. 
Assuming that an analysis based on categorical frequencies was of interesr; 
an interobsen'er agreement estimate (across all observers) could be obtained 
for each category by using intraclass correlation coefficients (Ebel, 1951; 
Haggard, 1958; Medley f, Mi tzel, 1958, 1963). 

Suppose a coefficient of .85 or greater for each category ,is considered 
to be satisfactory ^idonce of adequate observer training. Aside from prob- 
lems of interpretation which will ij^-discussed later, other considerations 
are paramount. IVhat if coefficients are much lower than .85 on several of 
the categories? Does, this mean tliat tliese categories are not clearly defined? 
Or does it mean that some of the observers do not clearly understand them? 
If so, which ones are the bad coders?- Are they the ones who deviate most 
widely from the rest of the group? Conversely, are these "deviates" the 
good coders, and the remainder of the group making a common mistake? More- 
over, even on categories with coefficients greater titan .85, such high 
intraobserver agreement does not necessarily reflect high agreement with the 
original definitions of these categories. 

In atteirpting to reduce the above-mentioned problems of using an inter- 
observer agreement measure, it seems advantageous to compare each observer's 
scores to those of a criterion or an expert coder with whom the researcher 
has confidence for strictly, consistently, and objectively following 
original category definitions. This type of observer agreement is referred 
to as criterion-related agreement. Such an agreement measure is more useful 
than an- interobserver agreement measure when decisions about adequacy of 
individual observer skills arc paranount. 



Moreover, criterion-related obs-erver agreement lends itself well to the 
design of instructional mterials for observer training (Thiagarajan, 1973), 
Dneof the problems in drawing conclusions from observational studies re- 
lating teacher behavior toj^pupil groivth Is^^^t^iliyis'^^^a^^^ 
findings from independent studies in which information about the behavioral 
variables is inadequate (Soar, 1972), Criterion-referencbd observer agree- 
ment measures within the context of an observation system instructional ' 
package could help reduce this problem. 
Agreement Unkler l^at Conditions and'^Tiow^^rfect?" 



^ ^ It has 
the conditi 



been implied that perfect observer agreement is desirable, but 
(jfns under which this applies liave not been specified, nor have 



the conditions been explicated under which observer disagreeme^nt is desirable 

Medley and Norton (1971) have purported that studies performed in the 
field to deter mine 6b server agreement before actual observational data 
collection should be discontinued. Rather, investigators need only document 
that their observers were competent upon completion of training--competency 
being determined via nearly perfect observer agreement on unambiguous 
exainples of behavioral categories shown on video tape. Conversely, they have 
aygued that perfect observer agreement during acttial data collection may not 
be particularly desirable. Since teachers andi pupils in the r^i ivorld do 
not always exliibit behaviors that neatly fall'into predefined ob ervational 
system categories, observer disagreement on ambiguities reveals a more 
representative picture of that real world. 

Although their argument is initially disconcei ting in that it may appear 
inconsistent, it does make sense from a practical standpoint. It is 
highly improbable that any observation system has such specifically defined 



and perfectly niutually exclusive categories that every behavioral ^vent that 
occurs can be clearly fitted into one of its categories. In all likelihood 
there will be some teacher/pupil behaviors which are ambiguous-^i .e. , they 
contain elements of two or more categories in the system. If observers, are 
brainwashed to the point that they consistently code the same ambiguous 
behaviors into a certain category, results could be biased. Alternatively, 
if one (^server codes an ambiguous behavior into one category and anotlier 
observer codes the same ambiguous behavior into* a- different category,_the__ — 
overall results nay indicate a more realistic description of that teacher's 

behavior. That is, in. the latter case there will be some tallies in both_ 

I, I 
categories^ rather than in only one category as in the former case., 

.. . ! "■■ -'",..„„. J ■ ■" . . 

Medley and Norton (IQ?"!) have suggested that a videotape containing 

solely'JpHigyW examples of each iategory for measuring observer agreement 

should be constructed. This coiild be done by taping a variety of live 

i • ■ 

classrooms. A/ panel of judges could then review these tapes and select a 

^ ■ } 

nu.'Tber of isolated, unambiguous examples to be dubbed onto another tape. 
This process, however, can be quite, ttiae-consuming and uneconomical. Moreover, 
the video and audio quality of such tapes tends to be rather poor. It is 
also sometimes difficult to find an adequate number of, examples of certain 
behavioral categories which occur infrequently in actual classrooms. 

An alternative approach is to videotape classroom simulations of 
isolated examples. This can be done in an environment where acceptable 
audio-visual tape quality can be maintained. Furthermore, examples can be 
randomly ordered, while producing only one original master tape with segment 
signs, directions, etci concurrently recorded. 

The issue that this type of observer agreement measure lacks validity 
could be raised. That is, observers are not being tested under the conditions 



to which they are exposed during action observation. hTiile this is true, it 

/ I. 

should bo noted that the purpose of such a criterion-related agreement measure 
is to test an observer's knmvledge of the iteris on the observation instrument. 
If observed agreement is measured- under actual observation conditions, it is 
usually impossible to separate unwanted observer errors in coding unambiguous 
examples from exj^ected and desirable observer disagrei^^nt on ambiguous ones. 
Therefore, it will be difficult to establish an adceptaple level of agreement 
for making decisions a|bouj^ individual observer competencies. 

- — Ki a compronisc to this^-i^sue/of validity, observers can also' be tesrted 
under more realistic conditions in addition to that of coding isolated 
unambiguous examples. Because of the above-r-»entioned paradox, hoxvever, it is 
suggested that observer agre©r»ent measures be interpreted in a different 



manner. IvTiile nearly perfect observer agreement with a criterion is expedted 

I.- ■ , ■ ■ ■ ^ 

for clear-cut examples, a different method and standard is recommended fpx. 

measures taken under actual coding conditions. / 

Tlie proposed additional method\s that of showing a video tape of 

\^ \ 

realistic conditions twice \to all observers. This tape should probably 
be about the same length as a normal observation period in a study, contain 
numerous examples of each category, and be fairly representative of the 
conditions under v/hich actual observation takes place. Rather than focusing 
on agreement with a criterion or with other observers, the issue of individual 
observer consistency can be addressed. Tlie extent to which each observer is 
consistent with himself can be measured by cQmparing results from the first 
viewing to tlibse of the second- Biascf? in coding ambiguous events that 
exist for each coder are expected to remain fairly consistent from viewing 
to vie\;ing, and t^Ims moderately hi^h in^raobserver agreement or consistency 
is demanded. . . 
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Again, a videotape simulation of classroom situations is advantageous, 
since, it is usually easier to control audio-visual quality and provide an 
adequate number of examples of each category than it is in taping live 
classrooms. 

While this paper stresses methods of making decisions about adequacy of 
observer training, observer vigilance is an equally iinportant problem. 
Although observers can be traine4 quite well with carefully designed in - 

struci;ional packages (Thiagarajan, 1973) or with a computer-assisted consensus 

i . ■ 

codi:.^ sch-ema (Semmel, 1972), there is no guarantee that those who perform 
well on a criterion test or demonstrate high intraobserver agreemeht scores 
will perform well as coders in the field. Strategies for coping with observer 
vigilance lie beyond the scope of this paper. However, lack of observer 
vigilance can be a significant source of error that limits reliabilities of 
observational recprds. 
How Can Agreement Be Measured? 

Since tliere is probably ho one best way of calculating observer agreement, 
a number of alternatives will be discussed. It should be noted that measures 
of observer agreement will be considered only in relation to criterion-related 
and intraobserver agreement checks. In the former an individual observer's 
codes on unambiguous, isolated examples are compared to a criterion in order to 
i.iake decisions abo^t the observer's knowledge of the system. In the latter 
test, an individual observer's bodes from one viewing of a videotape of 



realistic classroom conditions 
a second viewing of the same vi 



are compared to the same observer's qodes from 
deotape. i In botli types of agreement situations 
only two observers ax^e compared at a time, ^ ^..^ 

I ntraclasS Correlation Coefficients . Previous mention was made cdfr* 
cerning the use of intraclass correlation coefficients for determining 



\ 



interobserver agreement (Obel, 1951; Medley § Mitzel, 1958). l\fhile it 
was pointed out tliat measures across all observers are impractical for 
making decisions about individual observers, one other problem is yfencountered 
wi th this : me thod . . . ' - 

/'An intraclass correlation coefficient is an estimate of the ratio of 
true variance (^f behaviors to that of obtained variance. Obtained variance 
includes the ttue variance plus variance due to measurement error and any 
otlier factors not talien into consideration in the design (which are included 
in the error variance). If error variance attributable only to observers 
is isolated, and if the remaining variance is considered as true variance, 
the extent to which observer disagreement detracts from reliability of 
observt^d behaviors can Jbe estimated (iMedley f, Mitsel, 1958)'. 

The major problem facpd ~in measuring observer agreement in this manner 
(outside of regular data collection) is that the amount of true variance 
fluctuates depending on whether or not the situations selected to be coded , 
are grossly different or highly similar. If the selected situations 
iiffer widely on an^observation category, then more observer error variance 
can be tolerated than for situations across which there is little variance 
in the category. / 

For example, suppose that an investigator constructs a videotape test 

•.;ith widely varying situations, gives it to two observers, and finds an 

estimated true variance of 90 (among situations on category X) and an 

90 

observer error variance of 30. Tlien r^^^^ =90+30 » .75. On the other hand, 

suppose that a test is constructed with highly similar situations (i.e., 

little variance ar.ong situations on category X ) and is given to the same t;/o 

observers. If the true variance was estimated to be 6 and the observers 

were consistent with their disagrcenent in the former example (error variance 

_6 

of 30), then r^^^^ = 6+30*. 17. 
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All.otlier things being equal, the discrepancy between the two coeffici- 
ent$ in the above exaiTiple is due to the fluctuation in true variance among 
situations. Unless several previous studies with the same population and 
the same observation system have been done, it is difficult to estimate 
legitimately amounts of true variances expected in the field for teachers, 
pupils, and situations. Thus, it is hard to de,clde hoW much observer 
variance can be tolerated, ' . 

Moreover, even if the amount of expected true variance were known, it 

would probably be challenging to construct, a criterion tape containing only 

■ , - ■ . . ' ' \ ..." 

imambiguous examples that has a true |Var-iai|ice roughly equivalent to the 

ejtpected true variance in the field. i 

Although intraclass correlation coefficients should be used to determine? 
the extent to which observer disagreement limits reliabilities after a study 
is completed, they are impractical, for determining individual observer com- 
petencies before a study is begim. 

Simr)le Percentage Agreement .^ Simple percentage agreement can be calcu- 
lated in one of two" ways: Two sets of ratings can be compared on an item 
by item basis. That is, on a given item the observer either agrees (A) or 
disagrees (D) with the criterion. Observer agreement foi* a particular 
category i_ (wherever the expert ha^ recorded an i) is defined: 

Po ° • ^^i > / 

i- Iki + Zd^ 

wJiere ^A^ « total number of agreements for the i^tli category 

and ^ total nuriber of disagreements for the ith category 

^This and the following observer agreement measures are illustrated 
by example in Appendix A. 
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Agreement can also be calculated by conparing the t-to observers* total 
-frequencies for a given category across a number of situations. Thit is, 

. ^' ^li f2i 

Pi. . or , such that 0 < Pi < 1.00 

°V ^i hi - - 

i/here fj^^ - total frequency for the observer on cat.^gory 
and ^ total frequency for the expert on category; i_ 
Notice that the first coefficient tends to be more stringent than the 
second. iNTien using the former, an observer must be correct on almost every 
item (event) in order to achieve nearly perfect agreement, while on the second 
he could disagree on/ some specific items, yet end with total category 
frequencies very similar to the expert's. 

It can be argued that the second method is more appropriate if datja are 
to be analyzed on the basis of total category f/^equencies. Alternatively, ' 

high agreement v/hen using the first method would almost assure that the 

I . . — 

■. . . , . ■ \ ; 

observer really knew the system. One limitation of the first method, j 
however, is that it is not always possible to comparer ratings on an item by \ 
item' basis. ; \ " ■ j 

There are\ two other disadvantages to using i^ither of these methods of \ 

■ ' ' ' " " V ■ . ' ^ ' ^ 

simple percentage agreement. l.Tien low frequencifes of some categories occut 

while other categories occut very often, intei^iii'etation of coefficients maj^ 



be ambiguous. For example,' if an observer gets 2 out of 3 correct on \ 



1 



category X, and 19 out of 20 correct on cate^ry Y, coefficients of .67 and 
.95 are respectively obtained. Yet the observer deviates by only one tal^ly 

in each case. \ ' - 

■ I 

A solution to tliis problem is to structure the criterion test so that it 
contains an approximately equal number of unambiguous examples of each | 
category. It is suggested that 10 or more examples of each catejjory be| given 
in order to reduce the likelihood that an observer would obtain high \ 
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agreement by chance and to insure that he really knows the system. 

A further drawback H that neither of these methods accounts for chance" 
agreement. It is for this reason that Scott developed a -n coefficient of 
agreement. , 

Scott's Coefficient . Scott (1955) argued that measures of simple per- 
centage agreement may be inflated by chance agreement. He therefore proposed 
a IT coefficient which estimated the extent to which chance agreement has 
been exceeded when comparing tvyo observers' scores: 

I 

TT = " ^e 

■.. 1 ^ ' ■ ■ ■ • 

where = ^- Z_ "ii " 
" i=l 

= the number of items on which the two observers agreed 

n = the total na'nber of items coded 

: C p J ' 2 

and ss Uli^n. , n. ! = proportion of agreement expected 
i«lysl - J J by chance 

"ij * the number of codes in the i^th category for the 
j_th observer 

N. . = the total number of codes across C categories and 
J observers 

Pg is based on the marginal distributions of categories across ail 
obsi«rvers, and the same Pg is used for all comparisons. For each pairwise 
comparison of observers Scott ^^sumod that their proportional distributions 
of marginals were symmetrical a/nd approximately equal to the average pro-* 
portional distributions of marginals obtained from all observers. Since it 
is not always feasible to meet this assumption, Cohen (1960) suggested an 
alternative procedure of calculating P^. 
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Cohen's Kappa . Cohen (1960) proposed a k coefficient in which P., 
chance agreement, is based on observed marginal distributions rather than ^ 
expected marginals and requires no assumption that marginals are symmetrical 
and proportional to known populational marginals. P^, observer agreement, 
is computed in the same manner as Scott's, but is defined differently, 

■ D n ■ / ■ 



1 - p 
^ ^e 

C 



where n^^^ and n^j^ are observer marginals fbr each 
category and n is the number of items coded. 

The reader should note that both Scott's - and C«hon*s < yinl^ oio 

agreement coefficient across two or more categories for each nair o^ 

observers. Application of these noasuros when Msin;^ only one category is 

. ■• . .... ' ■ ' ^ ■ , 

inrr^pronriate. 

Light* 5 Extensi^^"'. o^ k . L?,fibt (1971) apreed that < was concentually 
attractive as a sinnle distance measure of acreep.ent. Me nref erred, 
however, to view the k statistic as a distanc(^ rr^easure of disa^reenent under 
tne hyr^othesis of random agrecncnt. That is, 

K = 1 . ''o 

.a? 

wliere d « 1 - P - observed proportion of disagreement 
0 o . 

and d - 1 - P « expected nronort ion of disagreement. 
Althout^h Lif^ht's fornulation is computationally equivalent to Cohen's 
K, it conceptually facilitate? extensions to other ty^es p"^ agreement 
reasures such as aprec^x-nt v;ith noro than two observers, conparisons of 
thf^ joint agreement of several ohservprs 'dth a standard, measures of 
conilitioniil arrecnnnt with two or more observers, and methods o^ 



*!istin,"ui shinp natterns of Rfrrec'npnt ^rov\ levels of arroonent betv.-een two 

hb'iervcTS. AH of thsse nothods require iten ly ^' ten connarisons, an-i 
\ ■ ■ 

nd^st are connutationally conplex. 

;. One of these Treasures, CornUtional a^recnont with two observers, 

aliows one to coware asreenent pX.ea obsen'or's score to a criterion score 

for "only those itens which one observer /The e}'n^(rt7 r)laced in the ith 

specific cate,<^ory ' (Lic^ht, 1971 , n. "67) . That isX 



Pi = - \ A T"/ 



/ where n^^ «= number of a)P:reenents between the observer ' 
/ and criterion coder on the ith category 

n = total nunber of items coded ] 

■ . . i 

n^j^ ' martjinal for the observer on the ilth category 

" marginal for the criterion coder on the ith 
category 

On advantage of k^, k, and tt is that they can be tested for signifi- 
cance (see Fleiss, Cohen, § Everitt, 1969; Light, 1971). A limitation of 
these coefficients is that item by item comparisons are mandatory if they 
are to be used appropriately (Emner, 1972). Since it is not always possible^ 
or convenient to obtain observational data in this form, Flanders (1967) 
has proposed a modi fication\ of Scot?' s n using category totals. 

Flanders* Modification . Flanders (1967) used the same formula" as Scott, 
but modified the computation of P^. Correction for chanco agreement is 

computed separately for each nair of\observers based on their observed 

\ 

marginals rather than usinu a common expected P based on all observers as 

e 

Scott did, • 
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whete s 1 - 

Of 



^il 



.1 



f|l » the freouency foiTtlid Ith category for observer 1 

f£2 " the frequency for the ith category for observer 2 

f 1 s the total frequency of codes across C categories 
for observer 1. 

f 2 * the total frequency of codes across C categories 
for observer 2 




and where V, 



The way in which Flanders has calculated P is not affected by zero or 

o 

low frequencies as is simple percentage agreement on category frequencies. 
However, his becomes dubious xihen two sets of ratings correlate highly. 
For example, if the first observer simply did not recognize half of the 
behavioral events occurring because he was a slow coder, while the second 
one did see them, it is possible that the following results could be obtained: 






ofeserver 1 
Frequency 


Oliserver 2 
Frequency 




2 


4 - - 




6 


12 




■4 ; ' 


8 




5' / 


. 10 


C5 


3 


6 



If FlandersV iT^ were utilized on these data, a coefficient of 1.00 
would result- yet it is evident that the two observers do not perfectly 
agree on individual category comparisons. (Sec Appendix A) 



this typo of agreement measure is appropriate if category propor- 
/ / tions or percentages are employed in analysis, it would be a misleading 

THcasure if data were analyzed using cat ef^ory frequencies or three-stage 
patterns of behavior as a unit of analysis. 
^ Garrett (1972) sugjjestod an alternative procedure for calculating tr 

which is not biased by highly correlated ^ata. is calculated in a nanner 
similar to the aforementioned second method of simple percentat^e agreement. 
Her P_ is computed in the same manner as Scotlr^s P^, but she considers 
the two observers to be the population each time. 

One limitation with Carrett^s tt^ is that it can cause an interpretation 
^- problem for infrequent occurrences of a given category that is similar to 

the one discussed for simple percentage ajEjreement measures. It too, like ~ 
Tf£, can only bo used as a descriptive statistic which cannot be subjected 
to a statistical test of significance. 
^-"^ich Agreement Coefficient is Appropriate? 

Although the conditions for measuring agreement have been specified, 
and a number of methods of obtaining agreement measures have been presented, 
the reader still may be left in a quandry--ivhich observer agreement measure is\ 
most appropriate? Moreover, once an a<:reenont coefficient has been 
selected, how large should that coeffilrient be in order to be acceptable? 
There are nrobably no two best answers to these ouestions. Based on the 
foregoing discussion, the previously specified conditions for measuring 
agreement, and the experience of the authors, the following ^>rocedures are 
recommended: .'. 

1. Observer agreement with a criterion . In the case of coding isolated 
unambiguous examples when making item by item comparisons, Cohen •s < or 
Light's Kp seem most desirable. Scott's tr should generally be avoided, since 

o • • - 
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since It is often difficult to meet the assumntion relative to marginal 
•^^l* Cohen's k appears to be most appropriate for determining 

observer agreement on a scale (cluster pf catejjories) , providinp that the 
scale is to be used as the unit of data analysis. On the other hand, if an 
individual category is the intended unit of analysis, Light's ic^seQins most 
appropriate. If a sequential analysis of behavior patterns is planned 
(e.g., CoUett § Seiranei, 1971), can be used for each pattern of interest. 
To be conservative, however, an agree- ment sho u44 be counted only if the 
order of recorded behaviors is identical for both observers for each instance 
of a particular pattern, disrejjarding any additional elements vs'hich are 
considered '•"'noise" in the s^uential vector. 

Interpretation of these coefficients is yet another matter. As Light 

(1971) has noted, it is possible to test the significance of W and k if 

■ •, - ■ P 

nominal data are used. Emmer (1972) has commented that such significance 
tests -are seldom used v;ith observer agreement measures--and perhaps for a 
pood reason. To have demonstrated that an observer agrees with a criterion 
at a level significant beyond chance does not guarantee that such agreement 
is nearly perfect. ' It is assumed that, jvh en Medley and Norton (l£i*71) 
referred to nearly perfect agreement, th^y were exiDeCting simple percentage 
agreement (P ) of .8S to .90 or greater. If P « .90 is considered to be a 
lower bound for making decisions on adequacy of observer skills, then k, 
from the experience of the present^uthors, typically falls, around .80 or 
greater, while is less predictable. 

Therefore, it seems most practical and logical to use the simple per- 
centage agreement measure if P^ 9 ^95 is demanded when making item by item 
comparisons of unambiguous' events. Interpretation is further aided if an 
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observer is given a criterion test consisting o? at least ten or more examples 
of each category (scale, chain) with an apnroximately equal number of 
examples of each unit of analysis. 

If item by item comparisons are impossible or impractical,/^ and k 
can be modified by assuminc^ that the lowpr of the total frequencies for each 

category for the two observers is the number of items upon which they agreed. 

- . — -I 

While k: and « (and^r) can be i^sed as a descriptive statistic in this manner, 
statistical tests of significance are clearly inapproprrate (Emmer. 1972). 
following the same line of reasoning, however, the aforemerkUoned second 
method of simple percentage agreement appears most reasonable, Trf P^ _> .85 
is demanded wlicn comparing t^otal category or scale freauercies of unambiguous 
events (assuming approximately~equal representation of categories.) 

2. Intraobserver agreement . It ^vas also suggested that a measure of 
intraobs^rver agreement be taken on two observations of the same videotape 
segment of a realistic setting. It should he noted that the purpose of 
criterion-referenced agreement is to assure a trainer that his coders know 
the-§ystem nearly perfectly, while the intent of intraobserver agreement is 
to d.emonstrate that coders can a?^ply their discrimination skills in an actual 
coding Situation-; ' 

Since item by item comparisons are usually impractical for the latter 
measure, the modification of k discussed above and the two modifications of 
^ arei^QSsible choices for an overall measure of intraobserver agreement* 
However, since ambiguous events are likely to occur, acceptability of obtained 
cxjefftcients should be interpreted more liberally than that for criterion^ 

referenced agreement^ 

The present authors have used average simnle percentage agreement with 

F' s .7'; as an acceptable lower limit of intracoder agreement when total 
o 
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category frequencies are the unit of analysis and there are few low frequency 
categories. Alternatively, Flandors* method of calculating is not 
affected by low frequencies of categories as if P^. His is preferable to 

when the distribution of category frequencies is unequal and some are 
quite low. The problem of a positive bias when pairs of ratings correlate 
highly is not as serious with the intraobserver check as it is with the 
criterion-referenced test, since ambiguous events are likely; and, more 
important, observers tend to see more and code more during the second viewing. 

If Flanders' is used as an overall intraobserver measure, and if one 
^^^^ ''of * '^^ as minimally acceptable, then will typically fall around 
.65 to .70 for five- to fifteen- category systems. 
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I WEST THATOR ERRORS 
l^ile observer agreement is one of the first concerns of an investigator, 
reliabilities of observational records are more imoortant. The nlural of 
reliability i$ used because more than one intraclass correlation coefficient 
can often be calculated for a given, set of observe.': ional records --particular- 
ly for imilti-faj:et designs. Each coefficient estimates the extent to which 
elements of a certain facet in the design can be consistently discriminated 
from each other. Components of variance which are considered to be true 
sources of variation rather than error variance depend on, the reliability 
coefficient of interest, which is in tiirn depiendent on the design of the 
study and objectives of the investigation (Gleser, Cronbach, and Rajaratnam, 
1965) . Thus, the design of an observationar study as FelTas t selection 
of facets which determine true and error variance components are important 
consideapatiohs, since these factors affect both the nature and degree of 

reliabilities of observational data. 

■/ ^ I ^ ■ . ■ 

Accounting for. Situational Factors 

In response to these considerations, McGaw, et al ^(1972) have coft- i 
tended that context or situmlTsn^t variables have been neglected or mis- 
treated in design and.analysis of observational studies. In disagreement 
with Medley and fUtzel (1963) , they have purported that variance in teacher 
(and pupil) behavior from situation to situation may be more lawful than 
it is random. Although the situations were never clearly defined, it is 
.assumed that context variables such as subject matter, class size, seating 
arrangements, group structure, nature of teacher and pupil task, time of 
the day (week), etc. were considered.; McHaw and his associates have argued 
that situations should be treated as a separate facet in determining 
reliabilities. More specifically, assuming that an observable dimension of 
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teacher behavior is of |.nterest, differences ainonj» teach:?rs (T), and £5itua- 
tions (S) are both cons|iderc'd to be true sources of vaHation. If situational 
factors are not taken into account, then variance attributable to situations 
is included in measurement error. It is the contention of. Mc^aw,' et al., 
that such designs or analyses which have treated situational variance as pan 
of generic error variance may have limited the reliability with which . V 
teachers (or pupils) were discriir,inated. 

Moreover, the interaction of teachers and situations (TxS) may be-sys- - 
tematic and is likevise treated as a true source of variation. For example, 
it. is sensible that in a large group social studies lesson with thirty pupils 
a teacher's quest ioiiing style may be quite different from his/her ques- 
tioning style when interacting vith one pupil working on an art project. 

Components of variance which are treated as error include variance in 
behavior over occasions (0 within TxS) , variance att^butable to rater 
differences (J), and other interaction components. Tliat is, the total 
variance for their design, 

■ ! ■ 

X t s . ts e 

where o2 ^ o^^^^^ \ o] + a2, ^2. ^ ^2^^^ ^ og^^g^j + 0% \ 
Thus, three coefficients of general izability, or indices of reliability, 
may be estimated: 

ps = ^1 / v^^s^^iy 

Pts= ^ts / '^ts* Vl ZFcGaw. et al., (1972), 
' • * . / pp. 24-25/. 

Tlie advantage of partitioning variance components and reliabilities in 

this manner is that data interpretation can be facilitated. For instance, 
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if both and p2 turn out to be small, and P^g is relatively large, it can 
be concluded that while neither differences in teachers (within situations) 
nor situations can be clearly discriminated, differences among teachers in 
their changes in behavior from one situation to another can be detected. 

.^Ji obvious implication of such an approach is that situation effects . 
should be considered when designing an observational study. To simplify 
data analysis procedures, it would appear advantageous that all teachers be 
^ observed an equal number of times in the same tyces of situations. Or, if 
such a decigrt is unwieldy, all teachers might be observed an equal number of 

times in the same situation. 

It should be noted that McGaw, et al . , (1972) have treated the situa- 
tions facet as a random factor in theTriresign. If levels of situations are 
selected in a non-random manner, or are considered a fixed effect, generali- 
zations must be restricted to only those levels rather than to the universe 
of situations as defined by McGaw, et al_. \^ 

Finally, it is necessary that observation schedules include items or 
scales for recording situational elements if the situation is to be con- 
sidered. Observatior instruments used in Project PRPtE (Kaufman, Semmel, Q 
Agard, 1973) and by Medley and Norton (1971) are good examples of instruments 
which account for a number of situational variables. 
Observer Assignmen t 

It is well known that as the number of items on a test is increased, 
the reliability with which the test can discfiminate subjects on a given 
dimension is likely to increase. The same principle apMies to observational 
studies. For exanmle. Medley and Mitzel (1958) found that increasing the 
number of observers per visit minimally affected reliability, while increasing 
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the number of visits per classroom substantially increased- the reliability 
with which teachers were discriminated on five observation scales. * 

• M Mitzel (1963) sugjriested that a reliability check be performed 

in the field (before actual data collection) in order to estimate the number 
of visits necessa^ for an acceptable coefficient. More recently, however, 
Medley and Norton (1971) have concluded that such a procedure is of little 
value. Rather, as many visits per classroom as possible are desirable in 
collecting actual observational data. 

Implications for observer assignment are that independent pairs of 
observers should ideally visit each classroom an equal number of times such 
that each rater is paired with every other rater equally often. Not only is 
analysis simplified by this procedure, but it is also possible to obtain a 
direct measure of observer agreement in the classroom . by using intraclass 
correlation coefficients (Hedley § MitzeL, 1958). 

Sometimes it is impractical to assign pairs of observers to each 
classroom. In fact, in order to maximize the number of visits to each class- 
room, and to minimize observer effects, one obsei'ver per visit is most 
desirable providing that "each recorder observes each individual at least 
once, and observes every individual the same number of times. This numl^er may 
vary across recorders --one recorder may see all individuals twice; anot^jer 
may see them three times each /Medley ^ Norton (1971); Exhibit 2, p. l7." 
With this procedure coefficients of observer agreement, stability of beha- 
viors, and reliabilities of both individuals and groups can be calculated. 
Poorly-designed Observation Systems 

Anoliicr factor which can affect reliability is a poorly -designed 
observation sy^'tem. Vrom the authors experience, systems with high in- 
ference categories (i.e., poorly defined, oi; based on extremely subtle 
and/or complex cues) cause problems in observer agreement during training. 



It logical!/ follows that this can curtail reliability coefficients, and 
serious interpretation problems are likely to result. 

In order to reduce such occurrences, Medley and Norton (1971) have 
stressed that "cbjectivity is ensured by defining categories so that these 
discriminations are based (1) on relatively obvious and easily recognized 
cues, and (2) on cues ^,?hich are {rtinimally dependent on sophisticated knoiifledge 
or on the observer's own set of values (p. 1)." Thiagarajan, M. Semmel, 
and D. Seramel (in press) have also delineated a method of concept analysis 
which can enhance objectivity of categories during development of a system 
or system-training materials . 

Two recommendations have been proposed to those who I use and/or are 
developing observation systems in order to help reduce such errors. First, 
Emmer (1972) has suggested that developers of observation systems, like devel- 
opers of commercial tests, should include various reliability coefficients 
and note the nature of the sample studied when publishing their systems. 
Tlius, other system users and developers would be able to make informed ! 
choices among observation systems and categories based on their reliability. 

Secondly; Thiagarajan (1973) noted that a number of observation systems 
are highly dependent on their original developers for purpose" of training 
coders. If persons other than the originators train observers on the systems, 
the intended nature of the systems can be inadvertantly distorted. This ^ 
could be one reason why results from observational studies using the same 
system with different investigators have been Inconsistent. (Soar, 1972). 
In order to reduce this problem, Thiagarajan has suggested that self- 
contained, criterion-referenced instructional packages be developed for 
observation systems. This procedure could enJiance the consistency with 
which a system is used. 



observer disagreement is important insofar as it limits the reliabilities 
of observatiqnal records. Once a study is finished, the extent to which 
observer errors detract from reliabilities can be estimated by intraclass 
correlation coefficients providing certain design requirements are met 
(Medley ?i Mitzel, 1963). It is too late, however, at this point in time to 
retrain observers if observer disagreement is found to be seriously limiting 
reliabilities. 

A. ( 

Hence, the previous disq^ission has evolved around methods and conditions 
und^r which observer agreement can be measured so ^s to mii|imi2e such an 

occurrence. , 

^> ^ / . . . . i . . ,' 

It has been concluded that observers^ should be trained to nearly perfect 

- ' ' / ■ I , • . 

agreement with a criterion or expert coder oft unambiguous exaniples of 

behavioral categories before actual data collection. Coders should thren 



be expected to 



agree on unamiiguous events encountered in the field. But 



disagreement 01(1 ainbiguous events observed in the field shoul<|l also be ex- 
pected, since teachers and pupils do not always exhibit behavioia which 

''4 ...... - ■ — ~ — T iilTf^, ' 

neatly fall into predefined observational system categories.! Disagreement 
on ambiguities may help reflect a more accurate representation of the real 
world (Medley and Norton, 1971). ; / 

Since the number of ambiguous events occurring in the field cannot 
be controlled, a measure of observer agreement in that situation is difficult 
to interpret. Rather, the best that can be done is to document that 
observers can accurately code unambiguous examples. This can be accomplished 
by showing observers a video tape containing only unambiguous examples. ^ 
In addition to criterion-related agreement, it was suggested that 



measures of intraobserver agreement be obtained bv showing a video tape 
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twice to all observers In which conditions parallel those encountered in the 
field. The purpose of an intraobserVer ap.reement measure is to demonstrate 
the extent to which each observer can consistently code under observational 
circumstances which closely approxirate classroon conditions. 

'Vhile several methods of calculating observer asreement have been 
proposed (e.g., Scott, 1955; Cohen, 1960; Flanders, 1967; Light, 1971; 
Garrett, 1972), little emphasis has been placed on interpretation of observer 
agreement measures. Due to lack of existing guidelines for making decisions 
on adequacy of observer training, relationships among various agreement 
mea^J^es wer^ discussed. It was concluded that simple percentage agreement 
> .85 for each unit of data analysis was acceptable for a criterion-related 
measure oi agreement on unambiguous videotaped examples. An overall propor- 
tion of agreement >^.75 was also recammended-f or an intraobserver measure on 
a videotape representative of realistic classroom coding conditions. 

In addition, the necessity of calculating j^igreement coefficients with 
the same type(s) of data (e.g., categbry frequencies, two-stage patterns, 

> u 

or scales) that are used in analysis of actual data collected in the study 
was emphasized. 

I'-Tiile criterion-related and intraobserver agreement measures have been 
recoiimiended for both before and during a study, these measures should not be 
used as evidence of observer agreement in the actual classroom. Rather these 
are measures to assist an investigator in documenting adequacy of observa- 
tional skills. The purpose of such efforts are to minimize the possibility 
that observers are primarily responsible for potentially unreliable 
observational data. 

After a study is finished reliabilities of observational data and 
coefficients of stability and observer agreement should be calculated by 



30 

uslnq intraclass correlation coefficients. It was emphasized that there are 
many, types of reliabilities, each depending on the design of the study and 
on how true and error variance components are partitioned. 

Since teachers and pupils may behave differently in different situations, 
"TFwas suggested that treatment of situations and subject X situation 
interactions as error variance in past studies may have limited the relia- 
Mlity with which teachers and pupils were discriminated. As a result, 
identification of classroom process variables tha\ relate to pupil growth 
may have been obfuscated. It was recommended that situational factors be 
included in observation systems in an attempt to reduce the large amounts 
of eirror variance typically attributable to instability of human behavior. 
The identification of significant situational factors remains yet to be 
determined emnirically, however. 

It was also enrphasized that appropriate methods of observer assignment 
can make it posiible to determine the extent to which observer! disagreement 
limits reliabilities relative to other sources of error such as instabijjity 
of behavior across occasions. Finally, it was mentioned that poorly designed 
observation systems can also curtail reliabilities. 

Considering the variety of di:^ferent types of errors that can enter 
into observational studies, it is not surprising that few, if any, relation- 
ships among classroom process variables and pupil outcome measures have 
been yet established. 
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Two kinds of siiriple percentage agreement, Scott's rt (1955), Cohen's k 
(1960), Light's (1971) extension of k , Flanders' (1967) modification of 
ir , and two other measures of agreement between an observer and an expert 
will be considered respectively. In order to provide a thread of 

continuity among these methods, /the same artificial data will be used in 

/ ■ ' ■ 

computational examples, -although tlie data will appear in different "forms."' 
Necessary assumptions, the required "form" of the data, the con5)utational 
formula, examples, and advantages and limitations of each method will be 
given. 

Assume that two obseryers coded a total of thirty events using a five 

category observation system. The categories used were: 

Cj s Teacher lecture > * 

« Teacher comprehension question 
C3 a Teacher convergent question 
Cit » Teacher divergent question 
Cs * Tocclior evaluative question 

Following are some artificial data. The total frequencies of each' 
category recorded during the thirty events are given for each observer: 



Obs. 1 Obs. 2 

C^il) , Cfi2) 

■Gi ^ 4 2 

Cz 6 - 12 

C3 4 8 

Ck 10 .5 

Cs 6 3 



Now, if observer agreement was measured on an item by item basis. 

r 

there are tti?o extreme cases, optimal and minimal agreement, in ivhich the 
two observers' codes could result in the above total frequencies (marginals) 



Insert Tables 1 § 2 Here 



si 



TABLE 1. An Optimal Case of Agreement on 

Individual Itms Given Fixed Marginals 
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livent 



Observer 1 
Code 



Observer 2 
Code 



Agreement 



1 

2 

3 

4 

5 

6 

7 

8 

9 
10 
11 
12 
13 
14 
15 
16 
17 

Id 

19 
20 
21 

22 
23 

. 24 

25 
26 
27 
28 
29 
30 

MARGINALS 

Ci 
C2 
C3 

Cm 
Cs 



C2 
Q2 

?1 



C2 
C2 
C2 



(sagree) 
(=disagreej 



Ci 
Cs 

C2 
C3 

C^ 

C2 

C2 

C2 

C3 

Observer 1 

4 
6 
4 
10 
6 



C3 

C3 
C2 

C2 
C2 

Co 



C2 
C3 

Observer 2 
2 

12 
8 
5 
3 



A 
A 
D 
A 
A 
A 
A 
A 
D 
A 
A 
t> 
A 
D 
A 
A 
A 
A 
A 
A 
A 
D 
D 
D 
D 
D 
A 
A 
0 
A 



LA = 20; ED = 10 
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TABLE 2. A Minimal Case o] 
Individual Items 



U V c 


uuserver 1 




LOGS 


1 
1 


C2 










A 










C5 


7 

/ — - 


G5 


ft 
o 




Q 




10 


Pi 


1 1 


P « 




C5 


X ^ 


P « 


Xf 


P. 


1 c 
ID 


C2 


ID 


C3 


1 7 


P 


Xo 


G3 




P . 


20 


C2 


21 


C2 


22 - 


€5 


23 


Ci 


24 


Ct, 




p 




p. 






28 




29 




30 




flARGINALS 


Observer 1 


. Ci 


4 


C2 


6 


C3 


4 




10 




6 
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' Agre<anent on 

Given Fixed Marginals 



Observer 2 


Agree 


Code 




Cs 


D 


C3 


D 


Ci 


D 


C2 


D 


C2 


D 


C3 


D 


C3 


D 


C3 


0 


C3 


D 


Ci 


D 


C2 


D 


C3 • 


D 


C2 


0 


C2 


D 




D 




D 




D 


1 Cit 


D 


C2 


D 


Cti 


n 

U 


Cs 


D 




D 


C2 


D 


C2 


D 


C2 


D 


C2 


D 


C2 


D 


Cs 


D 


C5 


D 






Observer 2 


ZA-O; 


2 




12 




8 




S 




3 





(^disagree) 



20*^30 
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Following are examples of different methods of calculating observer 
agreement using these data. 

1. SIMPLE PERCENTAGE AGREEMGNT: NOMINAL DATA 

1.1. Assumption : Observational records must be analyzed on an item x 

item basis. On a given item or event observers either agree (A) 
or disagree (D). 

,\ '> • 

.1.2. Computational Formul g 



EA + SD 

where ZA = total number of agreeing pairs, 
and ED = total number of disagreeing pairs. 
Range,: 0 1 1 1-00 

1.3* Examples 

Optimal Case: P » Vo « 20 » .67 

Minimal Case: P = , '0 a .00 I 

— ^17— : —6 0+30 — 

1.4. Advantages 

1.4.1. Computationally simple. 

1.4.2. Discriminates between optimal and minimal cases. 

1.5. Disadvantages - 

1.5.1. Does not eccount for "chance" agreement. 

1.5.2. Can only be done for two observers at a time. 

1.5.3. Often impractical to obtain nominal form observational 
data. 

2. SIMPLE PERCENTAGE AGREEMENT: FREQUENCY DATA 

2.1. Assumption ; Observational data is analyzed on the basis trf -total — 
tallies in each category. For a given category it is assumed 
that the lower scoi'e of two observers is the number of times 
they agreed, and the difference between scores is the number 
of times they disagreed. 
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2.2. Computational Formula 



P;.^ = fil or fi2, such that 0 < PA.< 1.00 

i Til m ~ ^ " 

1 

i where fil » score for observer 1 orj the ith category 

and fi2 » score* for observer 2 oii the same ith category 
Also, P', * average percent agi 



Range: 0< P* < 1.00 
— 0 — 



eement 



2.3. Examples 



O ptinia^ Case: P' «j6 « .50 

2 



Pn = 1/2 + 6 + 4 + 5 + 3\ « .50 



f4ininuil Case; Same as optimal case 
2.4. Advantages 

2.4.1. Computationally simple 



2.4.2. Agreement for specific categories or a group of 
categories can be calculated. 



2.5. Disadvantages 

2.5.1. Does not differentiate between optimal and mininuil cases 

2.5.2. Can only be done for two observers at a time. 

2.5.3. Does not account for "chance*? agreement 

2.5.4. May be affected by aero or low frequencies. For 
example: ■ 

Obs 1 Obs 2 ;Oifference| P^^ 

Cj 1 2 1 .SOn^ 



€-» 19 20 1 .95 , 

40 



7 



3. SCOTT'S TT (1935) 



pre] f or; !^-i'.''.;iAnL£ 
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Sco<tt argued tliat measures of siii5>Je percentage agrecsment may be 
inflated by "chance" agreement, lie therefore proposed a n coefficient 
which estimated the extent to which chance agreement has been exceeded. 

3.1. Assumptions i 

3.1.1. Nominal, rather than frequency, data must be used. That is, 
on a given item two observers either agree or disagree. 

3.1.2. Both observers must have identical marginal distributions 
of .category proportions, equal to known popullational values 
(Light, 1971, p. 367) • 

3.2. Using Contingency Tables 

Light (1971) suggested that a contingency table is conceptually 
usefuL in understanding Scott's tt, as well as Cohen's k, and Light's 
extension of c for measures of conditional agreement, agreement 
among more than tyro observers with each other and with a criterion, 
and measures of patterns of agreement. Therefore, a C x C 
contingency table will b'e constructed for purposes of illustrating 
ithe data. 

Observer 2 
Ci Co C C. C- 



Obierver 1 



marginal for 
Obs. 2 



Cl 


1 








ni4.^ 


"2 












"2+ 


C3 






(a) 






"3+ 






(b) 








"4+ 


















"+1 


"^2 


^3 


+ 4 


"^5 


N 




total numb^ of 
items 



Figure 1. A 5x5 Contingency Table 



For €ach4>«haviorel event a tally can be made In the COTitlngeflcy- 
table in Figure 1 showing agreement or disagreement and its nature. 
For example, suppose that for item (a) both observers recorded category 
#3. .An entry would be tallied on the main diagonal (C-J, showing 
that the two observers agreed on this item. For item (5) the first 
observer recorded category #4, while the second observer saw it as 
category #2. An entry would be tallied off the main diagonal (C4,) 
showing how they disagreed. . '♦^ 
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For the optimal case the contingency table is: 

Observer 2 



Observer 1 
(Expert) 



n. 



+1 







^2 




^4. 






s 


2 


2 


0 


0 




4 




0 


6 


0 


0 


0 


6 


C3 


0 


0 


4 


0 


0 


4 




0 


1 


4 


.5 


0 


10 




0 


3 


0 


0 


3 


6 




2 


12 


0 
0 


5 


3 





For the minimal case the contingency table is: 

Observer 2 



Observer 1 
(Expert) 



n 



+1 





S 






=4 




"i. 


Cl 


0 


4 


1 

0 


I. 


0 


4 




1 


0 


2 


1 


2 


6 




0 


0 


0 


4 


0 


4 


C4 


1 


8 


0 


0 


1 


10 




0 


0 


6 


0 


0 


6 


2 


12 


8 


5 ( 


3 


30 
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3. 3. Computational Formula 



o e 



1 - P 
C 

1^1 



0 

whero « -^F" = proportion of agreeing pairs 



n^^ » the number of items on which the two observers 
agreed for the ith category (main diagonal) 

n a total number of items coded 
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and P 



e 



i>'£3T copy n^Mimii 

2 
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n^j a the mrginal for observer j for the ith 
category (there are J observers) 

N.. = the total number of codes f0r all observers 



3.4. Examples 



Since the same is based on the population of observers and 
is used fot each pairwise conrparison of observers, it needs to be 



Hieir marginals 


are: 










( 

G 


Dbs. 1 


Obs. 


2 Obs. 3 


Obs. 4 


n. 
1. 


1 
^2 


4 


2 


6 


7 


19 




6 


12 


10 


11 


39 




4 


8 


2 


1 


15 
















10 


5 


■ 4 


3 


22 




6 












3 


8 


8 


:5 



e 



N. 

» .22 



120 



Accordini- tc assunption 3.1.2., tlie propbrtional distributions 
of marginals for each pairwise comparison of observers should b© 
symmetrical and they in turn are equal to known population 
proportions (population « J « 4 observers) 





Population 


Observer 1 


Observer 2 




[ni/N.n 


fnil/n.ll 


&i2/n-CI 




.16 ^ 


>■ .13 


.07 




.33 


.20 -H- 


.40 


c 

3 


.12 


.13 tH 


.27 




.18 


^ .33 


>■ .17 


s 


.21 •♦H 


.20 


► .10 



ERIC 



/ 
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/ 
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For observers 1 and 2 it cav( be seen that their marginal 
distrittutions of category proportions are not very symmetrical nor 
are they equal to the population proportions. Thus, Scott's v should 
not be used legitimately as a statistic with these data. For 
purposes of illustration ir will be computed in spite of the 
appajrent violation of assumption 3.1.2. ( 

Optimal Case 

/ . 

^o * J.(2 + 6 + 4 + S + 3) « 20 « .67 . 

30 30 ; 

^o-Pfi , -67 - .22 „ .45 ' v-«> * 
' ^ " yV^'l.OO - .22 778 " 

Minimal Case 



/ 



Pq =■ (0 + 0 + 0 + 0 + 0) « 0.00 

» -.28 



2(T 

0.0 - .22 



1.00- .22 

3.5. Advantages 

3.5.1. Corrects for chance agreement 

3.5.2. Relatively simple to conpute 

3.5.3. Can be tested for significance if assumptions are met 

3.5.4. Discriminates between optimal and minimal case 

3.6. Disadvantages 

3.6.1. Can only be used with two observers at a time 

3.6.2. Assumes that observed marginals must be symmetrical and 
approximately equal to known population values, which jnay 
not always be the case. 



4. COHEN'S K (1960) 



^•l- Assumption ; Nominal data are mandated. Identical marginals are not 
required, since chance agreement is based upon observed marginals. 



4.2. Conwutational Formula 



c 



where Pq= "ii ' proportion of agreeing pairs 

i»l 
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1 j; A-io 

e " M2Z-.("i+) = agreement expected by chance alon^ 

i»l ■ , ' 

^ii " ^""'b®^ 0^ items on which observers agreed for the ith 
category / 



N = total number of items coded 
n^^ a marginal for observer 1 for the ith category 
n^^i ^ marginal for observer 2 for the ith category 

'♦.3. Examples. i 

\ - _ ■ , ■■■■ : ■ ■" ' 

Optimal Case (see contingency tables in Section 3.2.) 

P. =1 (2+6+4+5+3) = 20 a .67 
° 30 30 

'i4n->2 (4x2 + 12x6 + 8x4 + 10x5 + 6x3) « l^O « .20 
® 900 



1.00 -.?0 .80 



Min^iiWl Case 

P^ »2o (0+0+0+0+0) = .00 



^e ".—'.2 (4x2 + 12x6 + 8x4 + 10x5 + 6x3) « = .20 

, 900 



0 - .20 „ -.20 

= r.bo -■ .50 -TM --25 . 

4.4. Advantages 

4.4.1. Relatively simple to confute 

4.4.2. Accounts for "chance" agreement 

4.4.3. Possible to statistically interpret /see Cohen (1960) 
and Light (197127 ~ . 

4.4.4. Discriminates between optimal and minimal case 

4.5. Disadvantages 

4.5.1. Only good for two observers at a time 

4.5.2. Cannot be legitimately used with quantitative data 

4.5.3. Only measures overall agreement. Does not indicate agreement 
for specific categories, although surveyance of off-diagonal 
ceils in the contingency table can be helpful. 



BEST COPY mum 



A-11 

5. LiaiT (1971): EXTENSIONS OF k ' 

Light agfeed that k was conceptually attractive as a simple distance 
measure of agreement. He preferred, howeveir, to view the k statistic as 
a distance measure of disagreement under the hypothesis of random 
agreement. That is: 

K -1 - ' ^ 

•A 

where ° 1 - Pq ° observed proportion of disagreement 
and d^ « 1 - = expected proportion of disagreement 

"Thus K becomes a ratio of measures of distance, or disagreements, between 
two observers, where distances are measured by counting a series of 
ones and zeros. These distance measures are simply a function of the 
numbers of agreeing versus disagreeing pairs in the J2n total responses.'* 
(P. 367). It should be noted that Light's k is computationally 
equivalent to Cohen's k. 

Viewing < in these terms Light has extended it to: 

1) conditional measures of agreement level with two observers, 

2) measures of agreement among more than two observers, 

3) measures of conditional agreement with more than two 
observers, 

4) comparison of the joint agreement of several observers with 
a standard 

and 5) a method of distinguishing patterns of agreement from levels 
of agreement between two observers. 

Due to the computational complexity of most of these extensions of ic 
they will not discussed here. The interest ed reader is referred to 
Light (1971, pp. 367-376.) 

6. LIGHT (1971) : GONDITIONAL AGREEMENT, "^p^ 

6.1. Assumptions 

The same assumptions hold as do those for Cohen's k, except that 
' allows tlie comparison of each observer's score to a criterion 
(expert) score for only those items which the criterion placed in 
the i_th specific category. 
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6.2. Comp utational Formula 

"ii 

'•pi = 1 -I I- Til, 

\ n / 

where rtj^^ » number of agreements between the observer and the 
criterion coder on the i^th category 

n » total number of items coded 

n^j^ s marginal for the observer on the itlt category 

n^^^ = marginal for the criterion coder on the ith category 

6.3. Example 

Opti^mal Case (for category 3; observer 1 is considered 
tlie expert) 

11-30/ 

f^inimal Case (for category 3) 




6.4. Advantages 

6.4.1. Can compare specific categories 

6.4.2. Can test for significance 

6.4.3. Accounts for chance agreement ^ 

6.4.4. /Discriminates between optimal and Tir tni i nal cas^S 

6.5. Disadvantages 

6.5.1. Cannot be used legitimately with quantitative data 
7. FLANDERS' (1967) MODIFICATION OF SCOTT'S tr 

^ Flanders used the general formula of tt (and k) but modified the 
confutation of for use with quantitative data. He also computed 
chance agreement similar to Scott's but used the average proportion 
per category rather than known populational proportions. 
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where P^^ « 



and P 



i»l 



.1 ^^.2 
2 



\ 



fji = frequendy for observer 1 
fj2 * frequency for observjer 2 

f.l » total number of tallies for C categories for observer 1 " 
f 2 « total nuinbei.- of tallies for C categories for observer 2 

2. gxaniple ; (See Table 3) 

3. Adva|itages - . 

7.3.1. Uses frequency rather than nominal comparisons 

7.3.2. Relatively simple to compute 

7.3.3. . P^„ is unaffected by low or zero frequencies in some 

categories as is average simple percentage agreement for 
frequency data 

4. Disadvantfli^es " ■ 

7.4.1. Cannot be used legitimately for specific category agreement 

7.4.2. Unable to statistically interpret 

If two observational records correlate positively, Flanders* 
V my overestimate observer agreement. This is most likely 
to happen when using an elvent-recording system rather than a 
unit- time recording sy8t(k. For example, if one observer 
only detected Imlf as many events as another, the foUowinij 
results could occur: ' 



7.4.3, 





Obs. 1 


Obs. 2 


Ci 


2 


4 


C2 


6 


12 


C3 


' 4 


8 


C^ 


S 


10 


Cs 


3 


6 


^fi 







\ -.. ■ ■ : : 




I 



•I • 

If 



fl 



0 
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Flanders* ihethod would yield a coefficient of 1.00 with the 
above data. \ Yet it is apparent that the two observers do 
not-"Agree" ott within category comparisons. This woald not 
be necessarilAundesirable if one were to analyze his data 
proportions of frequencies in categories to l3ie total 
number of fecordfefi events. ^ 

However, if frequencies in each category were compared 

during data ana ly'sik"," Planners' method would yield a 
misleading overestimate of observer agreement. 

8. A MODIFICATION OF SCOTT'S v UStNG FREQUENCY DATA 

In order to overcome the bias of Flanders' method when data are 
positively correlated and when using event recording systems, 
Garrett (1972) suggested a modification of Scott's it that is 
unaffected by correlated observational records. 

8.1. Cqiiq?utational Formula 



P - P 

^g " °g 

1 - Pe 

®g 



where P„. « i ^ {^AhJjl} 



'g C } 



i-1 {^il. fi2} 



and P 



e C— 
g i^ 



■il 



+ f 



i2 



f.l-^f.2 



Poo is equivalent to P' 
Scott's P^ in Section 3. 



in Section 2. 



P^ is equivalent to 



8.2. Example : (See Table 4) . 

8.3. Advantages 

8.3.1. Computationally simple 

8.3.2. Unaffected by correlated ratings 

8.3.3. Uses frequency rather tlian nominal data, which is helpful 
when item x item comparisons are impractical or impossible 
to obtain. 



8.3.4. Corrects for chance agreement, based on observed marginals*. 
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8.4. Disadvantages 

8.4.1. Measures overall agreement rather than specific category 
agreement 

8.4.2. Affqcted by low or zero frequencies of categories 
• • / " ■ . 

8.4.3. Cannot statistically interpret 
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